SALEPRICE PREDICTION FOR THE HOUSES

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import xgboost as xgb
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 4
import plotly.express as px
from sklearn.model_selection import GridSearchCV
from pandas.api.types import is_string_dtype
from pandas.api.types import is_numeric_dtype

READING AND JOINING THE TRAIN AND TEST DATA.

In [2]:
# Reading the train dataset
House_data=pd.read_csv("train.csv")
In [3]:
House_data.head()
Out[3]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

In [4]:
# Reading the test data set
House_data_test=pd.read_csv("test.csv")
In [5]:
House_data_test.head()
Out[5]:
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

In [6]:
 # Merging the test and train datasets, so that all the cleaning can be done at once.
House_data["flag"]="0"
House_data_test["flag"]="1"
final_house_data=pd.concat([House_data,House_data_test])
A:\Anaconda\lib\site-packages\ipykernel_launcher.py:4: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  after removing the cwd from sys.path.
In [7]:
# Removing these two columns since they have too many null values.
final_house_data.drop(["LotFrontage","GarageYrBlt"],axis=1,inplace=True)

There are 2 types of categorical variables one is ordinal that is the categories show some type of order, for instance qualities of a product or the ratings of a product. The other type is nominal that does not involves any type of order in the categories. Here two separate lists are made indicating the two types of categorical variables.

In [8]:
Ordinal_categorical=["MSSubClass","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Condition1"
                    ,"BldgType","HouseStyle","RoofStyle","RoofMatl","Exterior1st","MasVnrType","ExterQual","ExterCond","Foundation"
                    ,"BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir",
                    "Electrical","KitchenQual","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive",
                    "PoolQC","Fence"]
Nominal_categorical=["MSZoning","MiscFeature","SaleType","SaleCondition"]

Here we are removing the null values from the columns and imputing them with reasonable values. For instance, for the categorical columns, we are replacing null values with zeros and "unknown" string. Similarly for the numeric columns, we are replacing with mean values of the columns.

In [9]:
def nullvalueremovecategoricalcolumns(df_nullcheck,cols):
    for columns in cols:
        if (df_nullcheck[columns].dtypes=='int64')|(df_nullcheck[columns].dtypes=='int32'):
            df_nullcheck[columns].fillna(0,inplace=True)
        df_nullcheck[columns].fillna("unknown",inplace=True)
    print(df_nullcheck[cols].head())
    return df_nullcheck
        
    
In [10]:
Housedata_nullcheck=nullvalueremovecategoricalcolumns(final_house_data,["MSZoning","MiscFeature","SaleType","SaleCondition","MSSubClass","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Condition1"
                    ,"BldgType","Condition2","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation"
                    ,"BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir",
                    "Electrical","KitchenQual","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive",
                    "PoolQC","Fence","Functional","Neighborhood"])
  MSZoning MiscFeature SaleType SaleCondition  MSSubClass Street    Alley  \
0       RL     unknown       WD        Normal          60   Pave  unknown   
1       RL     unknown       WD        Normal          20   Pave  unknown   
2       RL     unknown       WD        Normal          60   Pave  unknown   
3       RL     unknown       WD       Abnorml          70   Pave  unknown   
4       RL     unknown       WD        Normal          60   Pave  unknown   

  LotShape LandContour Utilities  ... FireplaceQu GarageType GarageFinish  \
0      Reg         Lvl    AllPub  ...     unknown     Attchd          RFn   
1      Reg         Lvl    AllPub  ...          TA     Attchd          RFn   
2      IR1         Lvl    AllPub  ...          TA     Attchd          RFn   
3      IR1         Lvl    AllPub  ...          Gd     Detchd          Unf   
4      IR1         Lvl    AllPub  ...          TA     Attchd          RFn   

  GarageQual GarageCond PavedDrive   PoolQC    Fence Functional Neighborhood  
0         TA         TA          Y  unknown  unknown        Typ      CollgCr  
1         TA         TA          Y  unknown  unknown        Typ      Veenker  
2         TA         TA          Y  unknown  unknown        Typ      CollgCr  
3         TA         TA          Y  unknown  unknown        Typ      Crawfor  
4         TA         TA          Y  unknown  unknown        Typ      NoRidge  

[5 rows x 44 columns]
In [11]:
def nullvalueremovenumericcolumns(df_nullcheck_numeric,cols):
    for columns in cols:
        df_nullcheck_numeric[columns].fillna(df_nullcheck_numeric[columns].mean(),inplace=True)
    print(df_nullcheck_numeric[cols].head())
    return df_nullcheck_numeric
In [12]:
Housedata_nullcheck_numeric=nullvalueremovenumericcolumns(final_house_data,["LotArea","YearBuilt","YearRemodAdd",
                                                                     "MasVnrArea","BsmtFinSF1","BsmtFinSF2","BsmtUnfSF",
                                                                     "TotalBsmtSF","1stFlrSF","2ndFlrSF","LowQualFinSF",
                                                                     "GrLivArea","BsmtFullBath","BsmtHalfBath","FullBath",
                                                                     "HalfBath","BedroomAbvGr","KitchenAbvGr","TotRmsAbvGrd","Fireplaces",
                                                                     "GarageCars","GarageArea","WoodDeckSF","OpenPorchSF","EnclosedPorch",
                                                                     "3SsnPorch","ScreenPorch","PoolArea","MiscVal","MoSold","YrSold"])
   LotArea  YearBuilt  YearRemodAdd  MasVnrArea  BsmtFinSF1  BsmtFinSF2  \
0     8450       2003          2003       196.0       706.0         0.0   
1     9600       1976          1976         0.0       978.0         0.0   
2    11250       2001          2002       162.0       486.0         0.0   
3     9550       1915          1970         0.0       216.0         0.0   
4    14260       2000          2000       350.0       655.0         0.0   

   BsmtUnfSF  TotalBsmtSF  1stFlrSF  2ndFlrSF  ...  GarageArea  WoodDeckSF  \
0      150.0        856.0       856       854  ...       548.0           0   
1      284.0       1262.0      1262         0  ...       460.0         298   
2      434.0        920.0       920       866  ...       608.0           0   
3      540.0        756.0       961       756  ...       642.0           0   
4      490.0       1145.0      1145      1053  ...       836.0         192   

   OpenPorchSF  EnclosedPorch  3SsnPorch  ScreenPorch  PoolArea  MiscVal  \
0           61              0          0            0         0        0   
1            0              0          0            0         0        0   
2           42              0          0            0         0        0   
3           35            272          0            0         0        0   
4           84              0          0            0         0        0   

   MoSold  YrSold  
0       2    2008  
1       5    2007  
2       9    2008  
3       2    2006  
4      12    2008  

[5 rows x 31 columns]

We need to find the correlation between different features and the target variable using the pearson correlation matrix. The features having the correlation value greater than 0.5 are considered to be better for feeding into our model for prediction. The features are sorted according to their value of the pearson correlation.

In [13]:
corrs=final_house_data[final_house_data["flag"]=='0'].corr().abs()
In [14]:
s = corrs.unstack()
so = s.sort_values(kind="quicksort",ascending=False)
print(so["SalePrice"])
SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
MasVnrArea       0.475210
Fireplaces       0.466929
BsmtFinSF1       0.386420
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
KitchenAbvGr     0.135907
EnclosedPorch    0.128578
ScreenPorch      0.111447
PoolArea         0.092404
MSSubClass       0.084284
OverallCond      0.077856
MoSold           0.046432
3SsnPorch        0.044584
YrSold           0.028923
LowQualFinSF     0.025606
Id               0.021917
MiscVal          0.021190
BsmtHalfBath     0.016844
BsmtFinSF2       0.011378
dtype: float64

EXPLORATORY DATA ANALYSIS

We know which features are the most significant for our model, so we will check the distribution of those features with respect to the target variable in bar plot, scatter plot(with linear fit) and finally box plots to know the statistics.

In [15]:
plt.figure(figsize=(20,10))
sns.barplot(x='ExterQual',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672d033978>
In [16]:
plt.figure(figsize=(20,10))
sns.barplot(x='OverallQual',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[16]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672e32efd0>

It can be observed that OverallQual is the most significant column for our prediction, since it has the maximum value of correlation in the correlation matrix. There is a linear increase in the saleprice of the houses as the overall quality of the houses increases which is perfectly relatable.

In [17]:
plt.figure(figsize=(20,10))
sns.barplot(x='GarageCars',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672e7508d0>
In [18]:
plt.figure(figsize=(20,10))
sns.barplot(x='BsmtQual',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672e6bd160>
In [19]:
plt.figure(figsize=(20,10))
sns.barplot(x='KitchenQual',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672e736e80>
In [20]:
plt.figure(figsize=(20,10))
sns.barplot(x='FullBath',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672e749e48>
In [21]:
plt.figure(figsize=(20,10))
sns.barplot(x='GarageFinish',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672eaf4710>
In [22]:
plt.figure(figsize=(20,10))
sns.barplot(x='TotRmsAbvGrd',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672ee37d30>
In [23]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["YearBuilt"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="YearBuilt", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.523
P-value: 0.00000000

This shows a scatter plot with a line of best fit to the given data points.

In [24]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["GrLivArea"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="GrLivArea", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.709
P-value: 0.00000000
In [25]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["GarageArea"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="GarageArea", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.623
P-value: 0.00000000
In [26]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["TotalBsmtSF"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="TotalBsmtSF", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.614
P-value: 0.00000000
In [27]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["1stFlrSF"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="1stFlrSF", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.606
P-value: 0.00000000
In [28]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["YearRemodAdd"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="YearRemodAdd", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.507
P-value: 0.00000000
In [29]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["MasVnrArea"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="MasVnrArea", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.475
P-value: 0.00000000
In [30]:
plt.figure(figsize=(20,10))
sns.barplot(x='Fireplaces',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0x16732bbc518>
In [31]:
plt.figure(figsize=(20,10))
sns.barplot(x='FireplaceQu',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x16732bbc6a0>
In [32]:
plt.figure(figsize=(20,10))
sns.barplot(x='GarageType',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[32]:
<matplotlib.axes._subplots.AxesSubplot at 0x16732bbc438>
In [33]:
plt.figure(figsize=(20,10))
sns.barplot(x='HeatingQC',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x1673188bd30>
In [34]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["BsmtFinSF1"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="BsmtFinSF1", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.386
P-value: 0.00000000
In [35]:
plt.figure(figsize=(20,10))
sns.barplot(x='Foundation',y='SalePrice',data=Housedata_nullcheck[Housedata_nullcheck['flag']=='0'])
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x167318bc908>
In [36]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["WoodDeckSF"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="WoodDeckSF", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.324
P-value: 0.00000000
In [37]:
from scipy.stats import pearsonr 
corryu,_ =pearsonr(Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"],Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["2ndFlrSF"])
colorassigned=Housedata_nullcheck[Housedata_nullcheck['flag']=='0']["SalePrice"]
fig = px.scatter(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="2ndFlrSF", y="SalePrice",trendline="ols",color=colorassigned)
fig.show()
print("Pearson Correlation: %.3f" % corryu)
print("P-value: %.8f" % _)
Pearson Correlation: 0.319
P-value: 0.00000000
In [38]:
fig = px.pie(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], values='HalfBath', names='HalfBath')
fig.show()
In [39]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="ExterQual", 
             y="SalePrice", points="all",color="ExterQual",
             title="Distribution of SalePrice with External Quality of House",
            )
fig.show()
In [40]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="GarageCars", 
             y="SalePrice", points="all",color="GarageCars",
             title="Distribution of SalePrice with number of Car Garages in House",
            )
fig.show()
In [41]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="OverallQual", 
             y="SalePrice", points="all",color="OverallQual",
             title="Distribution of SalePrice with Overall present Quality of House",
            )
fig.show()
In [42]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="BsmtQual", 
             y="SalePrice", points="all",color="BsmtQual",
             title="Distribution of SalePrice with the quality of the basement in the House",
            )
fig.show()
In [43]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="KitchenQual", 
             y="SalePrice", points="all",color="KitchenQual",
             title="Distribution of SalePrice with the quality of the Kitchen in the House",
            )
fig.show()
In [44]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="FullBath", 
             y="SalePrice", points="all",color="FullBath",
             title="Distribution of SalePrice with the number of full bathrooms in the House",
            )
fig.show()
In [45]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="GarageFinish", 
             y="SalePrice", points="all",color="GarageFinish",
             title="Distribution of SalePrice with the status of Garage in the House",
            )
fig.show()
In [46]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="TotRmsAbvGrd", 
             y="SalePrice", points="all",color="TotRmsAbvGrd",
             title="Distribution of SalePrice with the total number of rooms above the ground in the House",
            )
fig.show()
In [47]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="Fireplaces", 
             y="SalePrice", points="all",color="Fireplaces",
             title="Distribution of SalePrice with the number of fireplaces present in the House",
            )
fig.show()
      
     
In [48]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="FireplaceQu", 
             y="SalePrice", points="all",color="FireplaceQu",
             title="Distribution of SalePrice with the quality of fireplaces present in the House",
            )
fig.show()
In [49]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="GarageType", 
             y="SalePrice", points="all",color="GarageType",
             title="Distribution of SalePrice with the type of garage present in the House",
            )
fig.show()
In [50]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="HeatingQC", 
             y="SalePrice", points="all",color="HeatingQC",
             title="Distribution of SalePrice with the quality of Heating in the House",
            )
fig.show()
In [51]:
fig = px.box(Housedata_nullcheck[Housedata_nullcheck['flag']=='0'], x="Foundation", 
             y="SalePrice", points="all",color="Foundation",
             title="Distribution of SalePrice with the type of material used for constructing the House",
            )
fig.show()

Now we need to encode the categorical columns so that we can feed them to our model for the prediction. Labelencoder is used for all the columns at once.

In [52]:
def labelencoding(df,cols):
    for columns in cols:
        from sklearn.preprocessing import LabelEncoder
        le = LabelEncoder()
        df[columns] = le.fit_transform(df[columns].values)
    print(df[cols].head())
    return df
    
In [53]:
Housedata_encoded=labelencoding(final_house_data,["MSZoning","MiscFeature","SaleType","SaleCondition","MSSubClass","Street","Alley","LotShape","LandContour","Utilities","LotConfig","LandSlope","Condition1"
                    ,"BldgType","Condition2","HouseStyle","RoofStyle","RoofMatl","Exterior1st","Exterior2nd","MasVnrType","ExterQual","ExterCond","Foundation"
                    ,"BsmtQual","BsmtCond","BsmtExposure","BsmtFinType1","BsmtFinType2","Heating","HeatingQC","CentralAir",
                    "Electrical","KitchenQual","FireplaceQu","GarageType","GarageFinish","GarageQual","GarageCond","PavedDrive",
                    "PoolQC","Fence","Functional","Neighborhood"])
   MSZoning  MiscFeature  SaleType  SaleCondition  MSSubClass  Street  Alley  \
0         3            4         8              4           5       1      2   
1         3            4         8              4           0       1      2   
2         3            4         8              4           5       1      2   
3         3            4         8              0           6       1      2   
4         3            4         8              4           5       1      2   

   LotShape  LandContour  Utilities  ...  FireplaceQu  GarageType  \
0         3            3          0  ...            5           1   
1         3            3          0  ...            4           1   
2         0            3          0  ...            4           1   
3         0            3          0  ...            2           5   
4         0            3          0  ...            4           1   

   GarageFinish  GarageQual  GarageCond  PavedDrive  PoolQC  Fence  \
0             1           4           4           2       3      4   
1             1           4           4           2       3      4   
2             1           4           4           2       3      4   
3             2           4           4           2       3      4   
4             1           4           4           2       3      4   

   Functional  Neighborhood  
0           6             5  
1           6            24  
2           6             5  
3           6             6  
4           6            15  

[5 rows x 44 columns]
In [54]:
# Checking the null values we do not have any null values after the cleaning, only the saleprice has since it is the target variable.
ax=plt.figure(figsize=(20,10))
sns.heatmap(Housedata_nullcheck.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x1672d033160>
In [55]:
ax=plt.figure(figsize=(20,10))
sns.heatmap(Housedata_nullcheck_numeric.isnull(),yticklabels=False,cbar=False,cmap='viridis')
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x16732d2ddd8>
In [56]:
ax=plt.figure(figsize=(20,20))
sns.heatmap(final_house_data[final_house_data["flag"]=='0'].corr(),cmap='coolwarm')
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x16732d42080>
In [57]:
# Finding the correlation of all the columns with the target variable, after encoding the categorical columns.
corrs=final_house_data[final_house_data["flag"]=='0'].corr().abs()
In [58]:
s = corrs.unstack()
so = s.sort_values(kind="quicksort",ascending=False)
print(so["SalePrice"])
SalePrice       1.000000
OverallQual     0.790982
GrLivArea       0.708624
GarageCars      0.640409
ExterQual       0.636884
GarageArea      0.623431
BsmtQual        0.620886
TotalBsmtSF     0.613581
1stFlrSF        0.605852
KitchenQual     0.589189
FullBath        0.560664
GarageFinish    0.549247
TotRmsAbvGrd    0.533723
YearBuilt       0.522897
YearRemodAdd    0.507101
MasVnrArea      0.475210
Fireplaces      0.466929
FireplaceQu     0.459605
GarageType      0.415283
HeatingQC       0.400178
BsmtFinSF1      0.386420
Foundation      0.382479
WoodDeckSF      0.324413
2ndFlrSF        0.319334
OpenPorchSF     0.315856
BsmtExposure    0.309043
HalfBath        0.284108
LotArea         0.263843
LotShape        0.255580
CentralAir      0.251328
                  ...   
Exterior2nd     0.103766
Exterior1st     0.103551
BsmtFinType1    0.103114
Heating         0.098812
PoolArea        0.092404
Condition1      0.091155
BldgType        0.085591
OverallCond     0.077856
MiscFeature     0.073609
LotConfig       0.067396
MSSubClass      0.062834
SaleType        0.054911
LandSlope       0.051152
MoSold          0.046432
3SsnPorch       0.044584
Street          0.041036
MasVnrType      0.029658
YrSold          0.028923
LowQualFinSF    0.025606
GarageCond      0.025149
Id              0.021917
MiscVal         0.021190
BsmtHalfBath    0.016844
LandContour     0.015453
BsmtCond        0.015058
Utilities       0.014314
BsmtFinSF2      0.011378
BsmtFinType2    0.008041
Condition2      0.007513
GarageQual      0.006861
Length: 79, dtype: float64
In [59]:
# Plotting the count of few columns
colorassigned=Housedata_encoded["OverallQual"]
fig = px.histogram(final_house_data, x="OverallQual", marginal="rug",
                   hover_data=final_house_data.columns,nbins=30,color=colorassigned)
fig.show()

It can be observed that most of the houses have the quality of medium that is 5. Only 31 houses have the quality of rating 10 that is outstanding.

In [60]:
colorassigned=Housedata_encoded["GarageCars"]
fig = px.histogram(final_house_data, x="GarageCars", marginal="rug",
                   hover_data=final_house_data.columns,nbins=20,color=colorassigned)
fig.show()
In [61]:
colorassigned=Housedata_encoded["ExterQual"]
fig = px.histogram(final_house_data, x="ExterQual", marginal="rug",
                   hover_data=final_house_data.columns,nbins=30,color=colorassigned)
fig.show()

Checking the Kurtosis and skewness to know the normality. It can be observed that there are few columns that have too high values, and those columns should be scaled using MinMaxscaler.

In [62]:
final_house_data.kurtosis(axis=0) 
Out[62]:
1stFlrSF            6.956479
2ndFlrSF           -0.422261
3SsnPorch         149.409834
Alley              13.908580
BedroomAbvGr        1.941404
BldgType            3.198456
BsmtCond            8.779150
BsmtExposure       -0.234520
BsmtFinSF1          6.908223
BsmtFinSF2         18.844012
BsmtFinType1       -1.337228
BsmtFinType2       10.576461
BsmtFullBath       -0.734139
BsmtHalfBath       14.860300
BsmtQual            0.950486
BsmtUnfSF           0.404783
CentralAir          9.983985
Condition1         15.708108
Condition2        308.542721
Electrical          7.648715
EnclosedPorch      28.377909
ExterCond           5.093682
ExterQual           3.697902
Exterior1st        -0.307937
Exterior2nd        -0.557545
Fence               2.734273
FireplaceQu        -0.765985
Fireplaces          0.076424
Foundation          0.756532
FullBath           -0.538129
                    ...     
LowQualFinSF      174.932812
MSSubClass         -0.473143
MSZoning            5.857871
MasVnrArea          9.351548
MasVnrType          0.550083
MiscFeature        32.267569
MiscVal           564.074582
MoSold             -0.454337
Neighborhood       -1.063214
OpenPorchSF        10.937353
OverallCond         1.479447
OverallQual         0.067219
PavedDrive          7.125856
PoolArea          298.633144
PoolQC            452.032360
RoofMatl           76.860298
RoofStyle           0.875131
SaleCondition       7.229055
SalePrice           6.536282
SaleType           13.683977
ScreenPorch        17.776704
Street            238.664799
TotRmsAbvGrd        1.169064
TotalBsmtSF         9.155258
Utilities        1186.322168
WoodDeckSF          6.741550
YearBuilt          -0.511317
YearRemodAdd       -1.346431
YrSold             -1.155147
flag               -2.001371
Length: 80, dtype: float64
In [63]:
# Scaling the columns that have too much Kurtosis.
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
final_house_data[['3SsnPorch','Condition2', 'PoolArea','Utilities','MiscVal','Heating','LotArea','LowQualFinSF','Street','RoofMatl','MiscFeature','EnclosedPorch']] = mms.fit_transform(final_house_data[['3SsnPorch','Condition2', 'PoolArea','Utilities','MiscVal','Heating','LotArea','LowQualFinSF','Street','RoofMatl','MiscFeature','EnclosedPorch']])

MODEL CREATION AND MAKING PREDICTIONS

Predicting using the Simple Linear Regressor

In [64]:
# Selecting the train part of the dataset by making flag=0
p=final_house_data[final_house_data["flag"]=='0']
In [65]:
# Removing the flag, saleprice and id columns from the features matrix
colss = [col for col in p.columns if col not in ['flag','SalePrice','Id']]
In [66]:
X=p[colss]
In [67]:
X.head()
Out[67]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 856 854 0.0 2 3 0 3 3 706.0 0.0 ... 8 0 1.0 8 856.0 0.0 0 2003 2003 2008
1 1262 0 0.0 2 3 0 3 1 978.0 0.0 ... 8 0 1.0 6 1262.0 0.0 298 1976 1976 2007
2 920 866 0.0 2 3 0 3 2 486.0 0.0 ... 8 0 1.0 6 920.0 0.0 0 2001 2002 2008
3 961 756 0.0 2 3 0 1 3 216.0 0.0 ... 8 0 1.0 7 756.0 0.0 0 1915 1970 2006
4 1145 1053 0.0 2 4 0 3 0 655.0 0.0 ... 8 0 1.0 9 1145.0 0.0 192 2000 2000 2008

5 rows × 77 columns

In [68]:
y=p["SalePrice"]
In [69]:
# importing train test set
from sklearn.model_selection import train_test_split
In [70]:
# splitting the training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [71]:
# importing linear regressor
from sklearn.linear_model import LinearRegression
In [72]:
# Instantiating linear regressor
lm=LinearRegression()
In [73]:
# Fitting the model on the training data
lm.fit(X_train,y_train)
Out[73]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [74]:
# Prediction on the training data
predictions_linearregressor_traindata=lm.predict(X_test)
dff = pd.DataFrame({'Actual': y_test, 'Predicted': predictions_linearregressor_traindata})
dff
Out[74]:
Actual Predicted
892 154500.0 147030.523841
1105 325000.0 326819.384941
413 115000.0 110563.170192
522 159000.0 178248.299368
1036 315500.0 319814.911865
614 75500.0 73371.963526
218 311500.0 231689.407379
1160 146000.0 144561.754408
649 84500.0 72783.536977
887 135500.0 153535.998112
576 145000.0 151687.187392
1252 130000.0 105478.037013
1061 81000.0 84508.292833
567 214000.0 208881.962494
1108 181000.0 165718.609535
1113 134500.0 134619.988010
168 183500.0 209067.543469
1102 135000.0 115970.723911
1120 118400.0 112784.975204
67 226000.0 231583.867472
1040 155000.0 142190.865704
453 210000.0 207733.225604
670 173500.0 185753.962412
1094 129000.0 121026.954378
192 192000.0 214083.828413
123 153900.0 156854.974000
415 181134.0 210820.972876
277 141000.0 91043.440238
433 181000.0 167895.483819
1317 208900.0 192036.757399
... ... ...
621 240000.0 215961.701584
1157 230000.0 207280.544270
1322 190000.0 244466.744171
704 213000.0 226082.252202
1323 82500.0 105887.409228
199 274900.0 321227.299652
493 155000.0 145304.541291
664 423000.0 326072.849697
1339 128500.0 113675.251657
1058 335000.0 366048.424123
1187 262000.0 246446.757515
10 129500.0 126691.711784
147 222500.0 226152.183437
764 270000.0 258183.530155
282 207500.0 205704.396216
298 175000.0 193940.306767
918 238000.0 276075.304495
291 135900.0 98054.996779
819 224000.0 216609.124366
573 170000.0 180518.672965
1454 185000.0 204091.535407
549 263000.0 229363.412216
462 62383.0 107811.478375
129 150000.0 148029.946591
845 171000.0 167829.376289
331 139000.0 117616.305442
323 126175.0 113078.370712
650 205950.0 223890.031989
439 110000.0 128342.942757
798 485000.0 407872.149974

438 rows × 2 columns

In [75]:
# Checking the mean squared log error on training data
from sklearn import metrics
metrics.mean_squared_log_error(y_test, predictions_linearregressor_traindata)
Out[75]:
0.07525545458722958
In [76]:
# Checking the score of the Linear regressor on training data
linearregressionscore=lm.score(X_test,y_test)
linearregressionscore
Out[76]:
0.8457065008586933
In [77]:
# Selecting the test part of the data
Lineartestdata=final_house_data[final_house_data["flag"]=='1']
In [78]:
Lineartestdata.head()
Out[78]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold flag
0 896 0 0.0 2 2 0 3 3 468.0 144.0 ... 120 1.0 5 882.0 0.0 140 1961 1961 2010 1
1 1329 0 0.0 2 3 0 3 3 923.0 0.0 ... 0 1.0 6 1329.0 0.0 393 1958 1958 2010 1
2 928 701 0.0 2 3 0 3 3 791.0 0.0 ... 0 1.0 6 928.0 0.0 212 1997 1998 2010 1
3 926 678 0.0 2 3 0 3 3 602.0 0.0 ... 0 1.0 7 926.0 0.0 360 1998 1998 2010 1
4 1280 0 0.0 2 2 4 3 3 263.0 0.0 ... 144 1.0 5 1280.0 0.0 0 1992 1992 2010 1

5 rows × 80 columns

In [79]:
# Removing the 3 unwanted features that will impact the prediction, and SalePrice is target variable so cannot be in feature matrix.
testdata = Lineartestdata.drop(['flag','SalePrice','Id'], axis=1)
In [80]:
testdata.head()
Out[80]:
1stFlrSF 2ndFlrSF 3SsnPorch Alley BedroomAbvGr BldgType BsmtCond BsmtExposure BsmtFinSF1 BsmtFinSF2 ... SaleType ScreenPorch Street TotRmsAbvGrd TotalBsmtSF Utilities WoodDeckSF YearBuilt YearRemodAdd YrSold
0 896 0 0.0 2 2 0 3 3 468.0 144.0 ... 8 120 1.0 5 882.0 0.0 140 1961 1961 2010
1 1329 0 0.0 2 3 0 3 3 923.0 0.0 ... 8 0 1.0 6 1329.0 0.0 393 1958 1958 2010
2 928 701 0.0 2 3 0 3 3 791.0 0.0 ... 8 0 1.0 6 928.0 0.0 212 1997 1998 2010
3 926 678 0.0 2 3 0 3 3 602.0 0.0 ... 8 0 1.0 7 926.0 0.0 360 1998 1998 2010
4 1280 0 0.0 2 2 4 3 3 263.0 0.0 ... 8 144 1.0 5 1280.0 0.0 0 1992 1992 2010

5 rows × 77 columns

In [81]:
# making predictions on the test data set
predictions_linearregressor_testdata=lm.predict(testdata)
In [82]:
predictions_linearregressor_testdata
Out[82]:
array([108440.52310196, 172140.67250594, 164479.75131626, ...,
       162993.21056134, 111775.93801525, 251065.30281754])

Predicting using the DecisionTree Regressor

In [83]:
# Importing the required libraries
from sklearn.model_selection import train_test_split
In [84]:
# Splitting the train and the test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
In [85]:
# Importing the deicion tree regressor
from sklearn.tree import DecisionTreeRegressor
In [86]:
# Instantiating the DecisionTree regressor
decisiontreereg=DecisionTreeRegressor()
In [87]:
# Fitting on the training data set
decisiontreereg.fit(X_train,y_train)
Out[87]:
DecisionTreeRegressor(criterion='mse', max_depth=None, max_features=None,
                      max_leaf_nodes=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      presort=False, random_state=None, splitter='best')
In [88]:
# Predictions for the training data sets
predictions_decisiontree_traindata=decisiontreereg.predict(X_test)
predictions_decisiontree_traindata
Out[88]:
array([154000., 302000., 109900., 168500., 325000.,  81000., 201800.,
       148500., 100000., 122500., 189950., 135000.,  90000., 174000.,
       179000., 109000., 190000., 118500., 113000., 191000., 143900.,
       227000., 173900., 112500., 179900., 186000., 164990.,  82000.,
       183200., 196000., 134900., 214000., 259500., 110000., 250000.,
       154000., 127000., 190000., 235000., 106500., 109500., 226000.,
       119900., 378500., 136500.,  87000., 114500., 135000., 412500.,
       137000., 120000., 175000.,  83000., 235000., 131500., 215000.,
       226700., 152000., 143250.,  93000.,  80000., 139000., 313000.,
       227000., 377500., 259500., 108000., 281213., 128500., 155000.,
       118964., 140000., 110000., 109900., 466500., 181000., 237000.,
       328000., 139000., 144000.,  94500.,  81000., 128900., 106000.,
       159000., 214500., 250000., 192000., 131500., 200141., 200000.,
       139000., 144000., 328000., 124000., 181000., 122000., 155000.,
       279500., 268000., 175000., 208500., 320000., 116900., 181900.,
       149900., 165500., 278000., 138000., 210000.,  55000., 133700.,
       154000., 138887., 239799., 121000., 110000., 110000., 136905.,
       257000., 190000., 131500., 175000., 177000., 121500., 144000.,
       233000., 121000., 150500., 185000., 180000., 315000., 188000.,
       110000., 113000., 320000., 306000., 116000., 224900., 625000.,
       278000., 118000., 187000., 168000., 129000., 100000., 230000.,
       180500., 158000.,  58500., 102000., 156000., 257500., 136900.,
        80000., 136905.,  98600., 157000.,  81000., 142000., 377500.,
       143250., 270000., 133000., 108000., 119000., 184000., 437154.,
       437154., 200000., 326000., 100000., 115000., 140000., 325000.,
       139000., 132500., 241000., 126000., 159434., 158000.,  73000.,
       134800., 178000., 237000., 161000., 324000., 234000., 202665.,
       110000., 122000., 110500., 108000., 155000., 181000., 159434.,
       189000., 100000., 161000., 126000., 302000., 190000., 143750.,
       325000., 179900., 151000., 337000., 132500., 168000., 106500.,
       235000., 122900.,  82000., 145900., 190000., 257000., 172500.,
       129000.,  60000., 124000., 151000., 202500., 272000., 119000.,
       233230., 143000., 112000., 100000., 145900., 105000., 100000.,
       179200., 144000., 143000., 212000., 214500., 172500., 137500.,
       215000., 100000., 110000., 268000., 239000., 361919., 187000.,
       125000., 159000., 148000., 135000., 110000., 181000., 167240.,
       145900., 110000., 149000., 134000.,  87000., 110000., 140000.,
       235000., 250000., 167900., 110000., 204000., 248328., 207500.,
       165000., 140000., 139950., 190000., 325624., 202500., 232000.,
        83000., 102000., 138000., 130000., 372500., 177000., 188700.,
       203000., 109500., 186500., 130000., 295493., 175000., 237000.,
       106500., 248328., 187500., 139500., 110000., 150750., 168000.,
       121000., 137450., 139500., 156000., 163990., 128000., 187500.,
       232000., 127500., 155000., 176500., 219500., 152000., 207500.,
       259500., 143750., 161750., 228950.,  87000., 136900., 137450.,
       173000., 179000., 171900., 302000., 109900., 232000., 158000.,
       106500.,  91000., 127500., 139600., 146000., 193879., 190000.,
        90000., 179000., 162900., 117000., 179000., 171900., 141000.,
       128000., 132000.,  76000., 222000., 179000., 135000., 140000.,
       145250., 225000., 269500., 312500., 110000., 129000.,  97000.,
       299800., 303477., 266000., 179200., 256300., 127000., 140000.,
        91500., 226700., 301500., 187000.,  99500., 270000., 241000.,
       132500., 181000., 186000., 105000., 145250., 153000., 120000.,
       179000., 147000., 220000., 147500., 119000., 239000., 175000.,
        89500., 192000., 155000., 190000., 201000., 136500., 142600.,
       228000., 167500., 110000., 260000., 190000., 102000., 162900.,
       134000., 147000., 143000., 361919., 116000., 225000., 186000.,
       142600., 163990., 340000., 118500., 155000., 119200.,  88000.,
       202500., 118000., 192000., 250580., 212000., 210000.,  52000.,
       307000., 132000., 320000., 106500., 361919., 324000., 135000.,
       226000., 236500., 250580., 148000., 228000., 200000., 250580.,
       168500., 190000., 226000., 106500., 165500., 228950., 142000.,
       154000., 214900., 115000., 437154.])
In [89]:
# Score for the decisiontree regressor on the training data set
decisiontreescore=decisiontreereg.score(X_test,y_test)
decisiontreescore
Out[89]:
0.7953109996374214
In [90]:
# Getting the decisiontree test data set using the flag filtering
Decisiontreetestdata=final_house_data[final_house_data["flag"]=='1']
In [91]:
# Removing the unwanted columns
testdata_decisiontree = Decisiontreetestdata.drop(['flag','SalePrice','Id'], axis=1)
In [92]:
# Making predictions on the test data set
predictions_decisiontree_testdata=decisiontreereg.predict(testdata_decisiontree)
In [93]:
predictions_decisiontree_testdata
Out[93]:
array([123000., 163000., 223500., ..., 143000., 118500., 239799.])

Predicting using the Randomforest Regressor

In [94]:
# Getting the train data using the flag variable.
q=final_house_data[final_house_data["flag"]=='0']
In [95]:
# Filtering the columns 
cold = [col for col in q.columns if col not in ['flag','SalePrice','Id']]
In [96]:
Z=q[cold]
In [97]:
t=q['SalePrice']
In [98]:
# Importing the required 
from sklearn.model_selection import train_test_split
In [99]:
# Splitting the training and the test set within the train dat set
Z_train, Z_test, t_train, t_test = train_test_split(Z, t, test_size=0.3, random_state=42)
In [100]:
# Importing the randomforest regressor
from sklearn.ensemble import RandomForestRegressor
In [101]:
# Instantiating the regressor and passing the required parameters to the regressor
Randomforestregr=RandomForestRegressor(n_estimators = 100,n_jobs = -1,oob_score = True, bootstrap = True,random_state=42)
In [102]:
# Fitting to the training data set
Randomforestregr.fit(Z_train,t_train)
Out[102]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
                      oob_score=True, random_state=42, verbose=0,
                      warm_start=False)
In [103]:
# Pedicting on the trained dataset
prediction_randomforest_traindata=Randomforestregr.predict(Z_test)
In [104]:
# Score on the train dataset
randomforestscore=Randomforestregr.score(Z_test,t_test)
randomforestscore
Out[104]:
0.9004845192655009
In [105]:
print('R^2 Training Score: {:.2f} \nOOB Score: {:.2f} \nR^2 Validation Score: {:.2f}'.format(Randomforestregr.score(Z_train, t_train), 
                                                                                             Randomforestregr.oob_score_,
                                                                                             Randomforestregr.score(Z_test, t_test)))
R^2 Training Score: 0.98 
OOB Score: 0.83 
R^2 Validation Score: 0.90
In [106]:
# Getting the test dataset to make the final predictions
Randomforesttestdata=final_house_data[final_house_data["flag"]=='1']
In [107]:
# removing the unwanted columns
testdata_randomforest = Randomforesttestdata.drop(['flag','SalePrice','Id'], axis=1)
In [108]:
# Making the predictions on the test dataset.
predictions_randomforest_testdata=Randomforestregr.predict(testdata_randomforest)
predictions_randomforest_testdata
Out[108]:
array([124300.25, 149270.  , 184263.  , ..., 154153.  , 119840.  ,
       226828.46])
In [109]:
# Getting the importances of the features of the dataset. The significance of each feature is shown in predicting the SalePrice.
feature_imp=pd.DataFrame(sorted(zip(Randomforestregr.feature_importances_,Z)),columns=["Significance","Features"])
fig=plt.figure(figsize=(20,20))
sns.barplot(x="Significance",y="Features",data=feature_imp.sort_values(by="Significance",ascending=False),dodge=False)
plt.title("Important features for predicting the SalePrice of the House")
plt.tight_layout()
plt.show()

Predicting using the XgBoost Regressor with Hyperparameters Tuning to give the best predictions

In [110]:
# Getting the train data from the dataset 
r=final_house_data[final_house_data["flag"]=='0']
In [111]:
# Getting the required columns
colsxg=[col for col in q.columns if col not in ['flag','SalePrice','Id']]
In [112]:
A=r[colsxg]
In [113]:
b=r['SalePrice']
In [114]:
# Splitting the train and test set within the train set
A_train, A_test, b_train, b_test = train_test_split(A, b, test_size=0.3, random_state=42)
In [115]:
# Instantiated the XgBoost model by passing the most optimal parameters that were obtained by performing the GridSearchcv.
XgBoostmodel = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    reg_alpha=0.002,
    gamma=0.3,
    max_depth=4,
    min_child_weight=4,
    subsample=1,
    colsample_bytree=1,
)
In [116]:
# Fitting on the train sample.
XgBoostmodel.fit(A_train,b_train)
Out[116]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=1, gamma=0.3, gpu_id=-1,
             importance_type='gain', interaction_constraints='',
             learning_rate=0.300000012, max_delta_step=0, max_depth=4,
             min_child_weight=4, missing=nan, monotone_constraints='()',
             n_estimators=100, n_jobs=0, num_parallel_tree=1,
             objective='reg:squarederror', random_state=0, reg_alpha=0.002,
             reg_lambda=1, scale_pos_weight=1, subsample=1, tree_method='exact',
             validate_parameters=1, verbosity=None)
In [117]:
# Making predictions on the train sample
Xgboost_prediction_train = XgBoostmodel.predict(A_test)
In [118]:
Xgboost_prediction_train
Out[118]:
array([141159.02 , 310660.3  , 126889.43 , 166978.34 , 357425.25 ,
        71755.484, 208392.66 , 145822.89 ,  77609.61 , 133613.66 ,
       156965.72 , 122183.555,  92820.78 , 194781.9  , 166008.16 ,
       142383.83 , 203335.38 , 136308.3  , 128102.82 , 228394.77 ,
       153302.97 , 234158.53 , 174569.25 , 137086.27 , 206710.4  ,
       171166.45 , 202451.55 ,  99426.42 , 179222.72 , 192666.62 ,
       126867.85 , 254761.1  , 193024.14 , 114267.734, 247484.97 ,
       158078.72 , 106125.71 , 202512.88 , 311866.22 , 116836.55 ,
       135183.03 , 235593.5  , 116465.88 , 369765.9  , 121184.32 ,
       124172.09 , 113525.5  , 126820.69 , 365006.6  , 131667.92 ,
       122824.41 , 178968.86 , 126321.98 , 338806.9  , 144974.55 ,
       271744.8  , 189598.9  , 153722.94 , 139299.77 , 119655.21 ,
        65607.52 , 169987.39 , 303576.1  , 304228.7  , 279040.47 ,
       225250.11 , 107081.33 , 358278.7  , 112612.555, 171456.14 ,
       118496.44 , 128660.6  , 111973.83 ,  82967.35 , 514432.8  ,
       168879.5  , 305786.66 , 287880.62 , 139515.84 , 122758.44 ,
       107090.05 ,  80707.85 , 113495.414,  91941.695, 152760.27 ,
       135488.03 , 272639.7  , 211070.8  , 143958.42 , 183080.88 ,
       145142.45 , 148277.69 , 125821.89 , 259270.92 , 106139.78 ,
       169591.38 , 186086.58 , 175584.11 , 206793.1  , 229678.67 ,
       169036.62 , 212937.86 , 277127.47 , 129586.88 , 175499.28 ,
       161609.81 , 158029.97 , 266113.75 , 138769.77 , 182429.66 ,
        66488.484, 119061.875, 136062.62 , 133576.2  , 194239.16 ,
       112220.32 , 106982.875, 104201.73 , 133107.7  , 253317.23 ,
       137798.11 , 142322.67 , 178390.45 , 198734.84 , 175740.28 ,
       128316.66 , 231172.03 , 116618.664, 140852.48 , 182238.88 ,
       182411.25 , 387298.56 , 202785.67 , 122766.8  ,  83989.266,
       335514.7  , 410405.25 , 128984.05 , 227064.56 , 635887.9  ,
       404424.53 , 125185.63 , 178361.66 , 143571.9  , 127628.75 ,
       128033.74 , 250621.52 , 188852.1  , 130916.02 ,  56318.492,
       108102.96 , 146244.64 , 247767.61 , 156185.33 ,  77679.51 ,
       121581.85 , 151154.62 , 154299.88 ,  88482.69 , 129703.53 ,
       195168.64 , 164840.67 , 327965.9  , 143662.83 , 118033.15 ,
        79548.34 , 222575.98 , 351997.5  , 525527.25 , 243750.8  ,
       383460.44 ,  84262.49 , 108987.75 , 175964.19 , 323791.28 ,
       141601.86 , 122771.76 , 222199.56 , 121177.33 , 161126.83 ,
       166230.81 ,  91733.31 , 134503.98 , 138764.81 , 281641.16 ,
       140867.45 , 271710.2  , 217880.62 , 193802.47 ,  76527.   ,
       114027.445, 112938.43 , 126164.086, 162841.02 , 174636.89 ,
       179553.7  , 180237.34 ,  92741.414, 176305.34 , 123949.625,
       224322.61 , 192573.38 , 118852.39 , 328405.62 , 191953.73 ,
       127419.984, 247381.27 , 135176.56 , 147378.22 , 117078.195,
       217220.64 , 138726.78 , 101399.12 , 170683.33 , 219061.55 ,
       276151.06 , 180452.08 , 156173.58 , 106385.25 , 128905.89 ,
       144431.9  , 223348.56 , 209391.42 ,  92912.07 , 245535.34 ,
       137491.33 , 102177.445,  94624.5  , 157458.34 , 114097.17 ,
       102860.15 , 183624.25 , 126167.48 , 139284.98 , 214382.33 ,
       166014.08 , 206296.55 , 151947.14 , 266525.1  , 113933.25 ,
       114464.92 , 240055.22 , 212086.78 , 437636.47 , 194083.88 ,
       119832.805, 159604.94 , 189727.97 , 144711.81 , 104160.23 ,
       174736.72 , 176642.92 , 152971.62 ,  91455.32 , 145662.28 ,
       149392.03 , 109375.484, 120421.1  , 180356.4  , 289203.25 ,
       259988.92 , 169351.5  , 128464.836, 221701.23 , 310320.6  ,
       206640.72 , 187407.66 , 139050.28 , 110757.56 , 177248.16 ,
       433229.47 , 221295.66 , 241890.58 ,  98119.32 , 100386.234,
       131133.84 , 134286.05 , 291270.34 , 236491.88 , 128725.41 ,
       204630.34 ,  97804.4  , 192077.1  , 109247.67 , 320800.16 ,
       169639.42 , 212159.88 ,  93998.53 , 257965.55 , 185027.16 ,
       120446.66 , 117130.88 , 149264.12 , 167592.84 ,  90943.65 ,
       130997.836, 142362.66 , 138815.94 , 179927.73 , 115193.05 ,
       175372.06 , 235930.2  , 117954.21 , 161101.38 , 166658.97 ,
       202545.39 , 151445.52 , 214637.88 , 195009.42 , 128802.914,
       171865.45 , 168379.39 ,  73014.445, 162601.58 , 141818.64 ,
       177781.47 , 186327.89 , 176246.75 , 271475.6  ,  89655.28 ,
       214779.95 , 136648.5  , 133218.7  ,  79609.664, 173798.19 ,
       154524.02 , 129265.03 , 196166.4  , 168230.97 ,  96254.96 ,
       148630.58 , 168047.55 , 128672.945, 187305.52 , 149113.67 ,
       127305.59 , 130718.09 , 118947.75 ,  78517.24 , 224089.39 ,
       170266.27 , 139294.89 , 124304.5  , 153397.62 , 212849.66 ,
       340537.03 , 349967.94 , 111838.16 , 217692.9  , 128741.32 ,
       297214.44 , 385006.03 , 305494.06 , 150774.88 , 254256.25 ,
       131918.4  , 123984.164,  81119.33 , 219446.39 , 339649.1  ,
       189118.12 , 136041.05 , 252190.02 , 222695.1  , 138671.94 ,
       187317.67 , 197960.38 , 105991.47 , 142592.14 , 161067.02 ,
       119453.13 , 178809.4  , 122945.445, 184621.45 , 157694.67 ,
       127041.82 , 221805.52 , 191624.9  ,  97780.88 , 198829.06 ,
       149003.97 , 196802.38 , 198466.86 , 166110.72 , 140016.6  ,
       217570.83 , 140288.22 , 103878.5  , 286662.9  , 198380.05 ,
       122883.63 , 111900.375, 142994.12 , 325115.34 , 137444.83 ,
       401320.5  , 137989.22 , 185307.44 , 171132.61 , 122077.92 ,
       160469.86 , 343938.47 , 101774.164, 157750.28 , 141526.89 ,
        92223.26 , 217057.23 , 125965.69 , 235753.44 , 205229.44 ,
       214667.6  , 214259.08 ,  84141.766, 285339.47 , 147226.39 ,
       393136.4  , 126115.4  , 346828.8  , 271352.2  , 130352.77 ,
       227012.67 , 243817.61 , 206185.84 , 159335.69 , 280792.22 ,
       115778.69 , 217244.7  , 177977.02 , 184655.94 , 236719.53 ,
       121642.336, 159058.17 , 181294.78 , 138776.02 , 134631.75 ,
       217863.08 , 133672.4  , 524440.9  ], dtype=float32)
In [119]:
# Score on training set
Xgboostscore=XgBoostmodel.score(A_test,b_test)
Xgboostscore
Out[119]:
0.9181719179745418
In [120]:
# Calculated the mean squared log error
metrics.mean_squared_log_error(b_test, Xgboost_prediction_train)
Out[120]:
0.01902488650213129
In [121]:
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold
In [122]:
from sklearn.model_selection import RandomizedSearchCV
from xgboost import XGBRegressor
In [123]:
params = {
        'min_child_weight': [1,3,4,5,7,10],
        'gamma': [0.5, 1, 1.5, 2, 5,0.3],
        'subsample': [0.6, 0.8, 1.0],
        'colsample_bytree': [0.6, 0.8, 1.0],
        'max_depth': [3, 4, 5,6]
        }
In [124]:
xgbreg = XGBRegressor(learning_rate=0.02, n_estimators=600,
                    silent=True, nthread=1)
In [125]:
folds = 3
param_comb = 5
skf = StratifiedKFold(n_splits=folds, shuffle = True, random_state = 42)
random_search = RandomizedSearchCV(xgbreg, param_distributions=params, n_iter=param_comb, scoring='neg_mean_squared_error', n_jobs=4, cv=skf.split(A_train,b_train), verbose=3, random_state=42 )
In [126]:
random_search.fit(A_train, b_train)
A:\Anaconda\lib\site-packages\sklearn\model_selection\_split.py:657: Warning:

The least populated class in y has only 1 members, which is too few. The minimum number of members in any class cannot be less than n_splits=3.

[Parallel(n_jobs=4)]: Using backend LokyBackend with 4 concurrent workers.
Fitting 3 folds for each of 5 candidates, totalling 15 fits
[Parallel(n_jobs=4)]: Done  15 out of  15 | elapsed:    5.8s finished
A:\Anaconda\lib\site-packages\sklearn\model_selection\_search.py:813: DeprecationWarning:

The default of the `iid` parameter will change from True to False in version 0.22 and will be removed in 0.24. This will change numeric results when test-set sizes are unequal.

[19:39:59] WARNING: C:\Users\Administrator\workspace\xgboost-win64_release_1.2.0\src\learner.cc:516: 
Parameters: { silent } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.


Out[126]:
RandomizedSearchCV(cv=<generator object _BaseKFold.split at 0x0000016738660B10>,
                   error_score='raise-deprecating',
                   estimator=XGBRegressor(base_score=None, booster=None,
                                          colsample_bylevel=None,
                                          colsample_bynode=None,
                                          colsample_bytree=None, gamma=None,
                                          gpu_id=None, importance_type='gain',
                                          interaction_constraints=None,
                                          learning_rate=0.02,
                                          max_delta_step=None, max_depth...
                                          validate_parameters=None,
                                          verbosity=None),
                   iid='warn', n_iter=5, n_jobs=4,
                   param_distributions={'colsample_bytree': [0.6, 0.8, 1.0],
                                        'gamma': [0.5, 1, 1.5, 2, 5, 0.3],
                                        'max_depth': [3, 4, 5, 6],
                                        'min_child_weight': [1, 3, 4, 5, 7, 10],
                                        'subsample': [0.6, 0.8, 1.0]},
                   pre_dispatch='2*n_jobs', random_state=42, refit=True,
                   return_train_score=False, scoring='neg_mean_squared_error',
                   verbose=3)
In [127]:
print('\n Best hyperparameters:')
print(random_search.best_params_)
 Best hyperparameters:
{'subsample': 0.6, 'min_child_weight': 10, 'max_depth': 3, 'gamma': 2, 'colsample_bytree': 1.0}
In [128]:
# Getting the test dataset from the whole dataset
Xgboosttestdata=final_house_data[final_house_data["flag"]=='1']
In [129]:
testdata_xgboost=Randomforesttestdata.drop(['flag','SalePrice','Id'], axis=1)
In [130]:
# Making predictions on the test data set
predictions_xgboost_testdata=XgBoostmodel.predict(testdata_xgboost)
predictions_xgboost_testdata
Out[130]:
array([124113.586, 148851.6  , 180837.2  , ..., 146814.5  , 119905.58 ,
       225434.42 ], dtype=float32)
In [131]:
# Exporting the results to a dataframe from an array and then converting it to csv file for export
resultcsv=pd.DataFrame(predictions_xgboost_testdata)
In [132]:
resultcsv.shape
Out[132]:
(1459, 1)
In [133]:
resultcsv.to_csv('Result.csv')

COMPARISON OF MODEL PERFORMANCES ON THE TRAIN DATASET

It was observed that XgBoost model was the best model that had the maximum accuracy.

In [134]:
# Creating a dictionary for all the models to store thier scores and convert this to dataframe.
dict={"Linear Regressor":[linearregressionscore],"DecisionTree Regressor":[decisiontreescore],"RandomForest Regressor":[randomforestscore],"XGBoost Regressor":[Xgboostscore]}
df_comparison_models=pd.DataFrame(dict,["Score"])
In [135]:
# Plotting the performance of all the 3 models on the train dataset.
%matplotlib inline
model_accuracy = pd.Series(data=[linearregressionscore,decisiontreescore,randomforestscore,Xgboostscore], 
        index=['Linear Regressor','DecisionTree Regressor','RandomForest Regressor','XGBoost Regressor'])
fig= plt.figure(figsize=(8,8))
model_accuracy.sort_values().plot.barh()
plt.title('Model Accuracy')
Out[135]:
Text(0.5, 1.0, 'Model Accuracy')

THATS ALL FOLKS !

This was the complete SalePrice prediction for the given business problem. The result.csv file was uploaded to Kaggle and got ranked 107/45000. Will be updating this kernel to make this model more accurate by hyperparameter tuning.